Have you ever looked up at the sky and wondered, is our planet Earth the only place with life on it? The answer to that question is, probably not. With this arises the other question, which of the planets out there can actually support life on it? That time is not far when humans too might be able to / have to start living on planets other than Earth.
With this comes the significance of this project. Back in 2009, NASA launched The Kepler Space Observatory, a NASA-built satellite for space observation. As mentioned by NASA, the satellite is dedicated to finding exoplanets or extrasolar planet which by definition is a planet outside the Solar System. The ultimate goal of this exploration is the posibility of finding other habitable planets, besides our own. The telescope within the satellite has remained in service since 2014 on an extended mission and NASA consistently releases the data of it for public research interest.
We took this data as our working dataset of the project. Through manipulation of this data, we somewhat played the roles of amateur astronomers, in hopes of finding out which of the key components identify an astral body as an exoplanet and then predict the status of such other astral bodies based on what we have learnt.
Our project research questions are devided into 5 main parts:
The objectives for our project will be:
Variables are separated into different sections for better understanding
Descriptions are captured from NASA Exoplanet Science Institute
https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html
Our focus is to build a model for classifying an astral body. We decide to use the following three categories of variables to build our predicting model:
Identification Variable
Transit Properties
Stellar Parameters
Based on our literature reivew, we finalized our variables as the followings:
Dependent Variable:
Independent Variables:
The data is downloaded from NASA Exoplanet Archive webpage:(https://exoplanetarchive.ipac.caltech.edu/docs/data.html). The data comes in a well-formatted, decent form, no heavy data cleaning work is required. We manipulate the dataset based on our own research interests: filtering out irrelavent variables, dropping the NULL data input, assigning the correct column type for each column, creating dummy variables for categorical variables (dummy variables creating is done in regression part).
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
nasarawdata=pd.read_csv('Nasa.csv')
nasarawdata.shape
regvarlist=[#Project Disposition Columns
'koi_score',
#Transit, and Planet Properties
'koi_period','koi_duration','koi_depth',
'koi_srho','koi_fittype','koi_prad','koi_sma','koi_incl','koi_teq','koi_dor',
#Stellar Parameters
'koi_steff','koi_slogg','koi_smet','koi_srad','koi_smass']
nasadataregvar=nasarawdata[regvarlist].copy()
nasadataregvar.dtypes
All columns are read in correct types.
#Delete observations contain any null values
nasadatareg=nasadataregvar.dropna()
nasadatareg.shape
In total, we have 7989 vaild oberstavions. We have 1 dependent variable and 15 independent variables.
From prelinimary literature review, we got the meaning of all the columns so as to solve the first the question: How is the proportion of exoplanets being confirmed in the existing literature but disconfirmed by measurements from Kepler? How about the other way round?
We used raw data to conduct this analysis.
kepmatchstat=nasarawdata[['kepid','koi_disposition','koi_pdisposition']].copy()
kepmatchstat.groupby(["koi_disposition", "koi_pdisposition"]).size()
Because in Kepler mission, to be conservative, no 'CONFIRMED' value for koi_pdisposition column, we would treat observations having 'CONFRIMED' in koi_disposition and 'FALSE POSITIVE' in koi_pdisposition, and observations having 'FALSE POSITIVE' in koi_disposition and 'CANDIDATE' in koi_pdisposition as mismatching.
The python output indicates that there are in total 8 mismatching cases out of 9564 observations, an extreme similarity between archive-determined exoplanets and Kepler exoplanet determining algorithm. The result brings us confidence to use literature derived key parameters to construct (or deduct) the algorithm that the Kepler mission might use.
nasadatareg.head(5)
nasadatareg.describe().T
The descriptive statstics give us a brief understanding of our data. The percentile data tells us that our data have heavy skewness and extreme values. We will explore more insights in the following analyses.
hisvarlist=['koi_period','koi_duration','koi_depth','koi_srho','koi_prad','koi_sma','koi_incl','koi_teq','koi_dor',
'koi_steff','koi_slogg','koi_smet','koi_srad','koi_smass']
for i in hisvarlist:
plt.figure(figsize=(8,5))
plt.title("{}".format(i),fontsize=16)
plt.hist(nasadatareg[i],color='cyan',edgecolor='k')
plt.grid(True)
plt.xlabel(i,fontsize=14)
plt.show()
We visualized each column (only continuous one) by plotting the histogram, we found that most of the variables are heavily skewed.
Such visualizations gave us an impression that for our modeling purpose, perhaps applying certain transformations on the indepedent variables is better than using the identical form.
plt.figure(figsize=(8,5))
plt.title("koi_score",fontsize=16)
plt.hist(nasadatareg['koi_score'],color='cyan',edgecolor='k')
plt.grid(True)
plt.xlabel("koi_score",fontsize=14)
plt.show()
The vast majority of the koi_score is 0 and 1.
The violation of normal distribution indicates that a linear regression model might not be that robust.
Moreover, the nearly dichotomized koi_score indicates a logistic regression model could be a good alternative to serve our purpose.
for c in nasadatareg.columns[1:]:
plt.figure(figsize=(8,5))
plt.title("{} vs. \nkoi_score".format(c),fontsize=16)
plt.scatter(x=nasadatareg[c],y=nasadatareg['koi_score'],color='blue',edgecolor='k')
plt.grid(True)
plt.xlabel(c,fontsize=14)
plt.ylabel('koi_score',fontsize=14)
plt.show()
For transit property plots (plots before _koi_steff vs. koiscore ), most of them are in 'C' shape, showing plenty of observations clustered around the upper limit (koi_score=1) and the lower limit (koi_score=0).
For stellar parameter plots (plots after _koi_steff vs. koiscore ), most of them are in 'I' shape. Same as transit property plots, lots of the observations are clustered around the upper limit and the lower limit; the difference is the observations are more scattered around than in transit property plots.
Also, the scatterplots show some outliers,
These are not good signs for linear regression because from the scatterplots hardly can we find a linear trend line to describe the relationship of koi_score along with each paramter (a perpendicular line to the X-Axis is not a feasible trend), and our linear model may bias becasue of the outliers. However, the clustering feature is a good sign for applying logistic regression.
nasadatareg2=pd.get_dummies(nasadatareg)
nasadatareg3=nasadatareg2.rename(columns={"koi_fittype_LS+MCMC": "koi_fittype_LS_MCMC"})
We also want to know about the associations between Dependent variables
nasadatareg3.head()
corr = nasadatareg3[nasadatareg3.columns[:-2]].corr()
corr
Correlation matrix is a basic method to view pairwise correlation coefficients. However, it is a little bit hard for distinguishing which pairs are highly correlated. We decide to use the pairwise scatter plots and heatmap for more informative displays.
from seaborn import pairplot
pairplot(nasadatareg3[nasadatareg3.columns[:-2]])
We generate a visualized correlation matrix in scatter plot using pairplot in Seaborn. From this plot matrix we can find some correlated pairs:
from statsmodels.graphics.correlation import plot_corr
sns.set(rc={'figure.figsize':(12,7)})
fig=plot_corr(corr,xnames=corr.columns,title='Correlation Heatmap')
The heatmap outstandingly helps us to find out which pairs are highly correlated. The girds have deep colors (Both in deep red and in deep blue) indicates the corresponding pairs are highly correlated.
The highly correlated pairs are:
We decide to keep all of them into the saturated model although we know that will introduce multicollinearity issue. The reason is that we consider those parameters are crucial based on our literature review.
Library statsmodels is used to generate and test our predicting model.
import statsmodels.api as sm
import statsmodels.formula.api as smf
At the beginning, we need to specify the linear model, a '~' sign represents the equal sign connecting the dependant variable and independant variables.
model_str = nasadatareg3.columns[0]+' ~ '+'+'.join(nasadatareg3.columns[1:-2])+'+'+nasadatareg3.columns[-1]
model_str
model=smf.ols(formula=model_str, data=nasadatareg3)
fitted = model.fit()
print(fitted.summary())
We select koi_teq and koi_fittype_LS_MCMC as examples to interpret our model.
result=pd.DataFrame()
result['pvalues']=fitted.pvalues[1:]
paramlist=nasadatareg3.columns[1:-2].tolist()
paramlist.append('koi_fittype')
result['Parameters']=paramlist
result.set_index('Parameters',inplace=True)
result['Statistically significant?']= result['pvalues'].apply(lambda x:'Yes' if x<0.05 else 'No')
result
R-square is 0.341 and Adj. R-squared is 0.34, both mean our predicting model is not that capable of doing the predition work since more than half dependant variance is left unexplained.
Parameters koi_srho, koi_prad, koi_dor, koi_steff are not statistically significant at 0.05 level, however, since we valued literature reivewed model more, and p-values are acceptable, we decide to keep them in the model.
We have our model now, but we want to know more about how well, and robust our model is. This sector is mainly about residual analysis to check if certain linear regression assumptions are held or not.
for d in nasadatareg[hisvarlist]:
plt.figure(figsize=(8,5))
plt.title("{} vs. \nModel residuals".format(d),fontsize=16)
plt.scatter(x=nasadatareg[d],y=fitted.resid,color='blue',edgecolor='k')
plt.grid(True)
xmin=min(nasadatareg[d])
xmax = max(nasadatareg[d])
plt.hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3)
plt.xlabel(d,fontsize=14)
plt.ylabel('Residuals',fontsize=14)
plt.show()
Residual plots show some bits of clustering. Some of them have an oblique line at the bottom, suggesting the residuals follow a slightly positive trend. These characteristics tell us that the residual are following some undesired pattern, and thus the assumption of homoscedasticity is violated.
Also, this non-random pattern indicates that the deterministic portion of this model is not capturing the whole explanatory information that is affecting the residuals. Possibilities that could cause such patterns include:
Some of the plots also show residual outliers that are away from the main cluster.
plt.figure(figsize=(8,5))
p=plt.scatter(x=fitted.fittedvalues,y=fitted.resid,edgecolor='k')
xmin=min(fitted.fittedvalues)
xmax = max(fitted.fittedvalues)
plt.hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3)
plt.xlabel("Fitted values",fontsize=15)
plt.ylabel("Residuals",fontsize=15)
plt.title("Fitted vs. residuals plot",fontsize=18)
plt.grid(True)
plt.show()
This plot looks pretty bad. Again, a clear negative trend of residuals vs. fitted values indicates that we must have omitted one or more variables from our model that could be account for this trend line. However, we couldn't know exactly which one(s) we missed.
We applied both raw residuals and standardized residuals for normality test.
plt.figure(figsize=(8,5))
plt.hist(fitted.resid_pearson,bins=20,edgecolor='k')
plt.ylabel('Count',fontsize=15)
plt.xlabel('Standardized residuals',fontsize=15)
plt.title("Histogram of standardized residuals",fontsize=18)
plt.show()
The histogram shows that the normalized residuals are neither symmetric nor bell-shape distributed.
from statsmodels.graphics.gofplots import qqplot
plt.figure(figsize=(8,5))
fig=qqplot(fitted.resid_pearson,line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of standardized residuals",fontsize=18)
plt.grid(True)
plt.show()
The Q-Q plot shows that, although not perfectly aligned, the standardized residuals are generally following the diagnol reference line.
from scipy.stats import shapiro
_,p=shapiro(fitted.resid)
if p<0.01:
print("The residuals seem to follow normal distribution")
else:
print("The normality assumption may not hold")
Finally we used Shapiro-Wilk test for testing the normality. The test result shows the residuals are very likely following normal distribution.
from statsmodels.stats.outliers_influence import OLSInfluence as influence
inf=influence(fitted)
(c, p) = inf.cooks_distance
plt.figure(figsize=(8,5))
plt.title("Cook's distance plot for the residuals",fontsize=16)
plt.stem(np.arange(len(c)), c, markerfmt=",")
plt.grid(True)
plt.show()
Cook's distance is for outliers detection. As the plot shows, there is one significant outlier in our dataframe. In our future model improving process we will remove this outlier and rerun the model to see how well our model could be improved.
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
for i in range(len(nasadatareg3.columns[1:])):
v=vif(np.matrix(nasadatareg3[1:]),i)
print("Variance inflation factor for {}: {}".format(nasadatareg3.columns[i],round(v,2)))
There are two parameters with VIF > 10, thereby indicating significant multicollinearity. Because we intentionally included highly correlated exogenous variables in our model, this result is not beyond our expection.
We successfully built the model for predicting the koi_score of an astral body according to its own properties. However this model has several flaws that could be corrected in our future analysis.
Recall the pattern of the histogram of dependent variable (koi_score), vast majority of the observations are 0 and 1. We consider Logistic Regression might perform better than linear regression model.
nasadatalogvar=nasarawdata[regvarlist].copy()
For logistic regression, the outcomes are binary values (e.g. 1 or 0, Yes or No). Therefore, we dichotomize the koi_score into 0 or 1 according the criteria: koi_score value above 0.5 will be assigned value 1, otherwise it will be assigned value 0.
#Dichotomize koi_score
nasadatalogvar['score_bi']=nasadatalogvar['koi_score'].apply(lambda x:0 if x<=0.5 else 1)
#Row-wise drop null values
nasadatalog=nasadatalogvar.dropna()
nasadatalog['score_bi'].value_counts()
sns.countplot(x='score_bi', data=nasadatalog, palette='hls')
plt.show()
In total, we have 7989 observations with complete observation values.
The two groups looks well balanced, there is no need for oversampling to balance the groups.
meancomparison=nasadatalog.groupby('score_bi').mean()
for i in meancomparison.columns[1:]:
plt.figure(figsize=(8,5))
plt.title("{} vs. \nscore_bi".format(i),fontsize=16)
plt.bar(meancomparison.index,meancomparison[i],color='blue',edgecolor='k')
plt.grid(True)
plt.ylabel(i,fontsize=14)
plt.xlabel('score_bi',fontsize=14)
plt.show()
We plotted the bar charts of two score group means to do a naked eye evalutation of, for each variable, the existance of distinguishable difference between two groups.
Considering the magnitutde of Y-axis, we can only conclude that the mean value of koi_depth in two score groups are substantially different.
For more statistical conclusions, we will rely on two sample t-test to determine if there exists statistical difference between two groups for each independent variable.
Before applying the two sample t-test, we have to determine if the variances of the two groups are equal.
Recall the histogram of each dependent variable, most of them are heavily skewed, Levene Test is robust for this situation.
The process is
import scipy.stats as ss
from scipy.stats import levene
nasadatalog.groupby('score_bi').mean()
twos_ttest=[]
for i in hisvarlist:
#Variables 'koi_steff','koi_smet' and 'koi_smass' are not so heavily skewed, set center argument to 'median'
if i in ['koi_steff','koi_smet','koi_smass']:
if levene(nasadatalog[nasadatalog['score_bi']==1][i],
nasadatalog[nasadatalog['score_bi']==0][i],center='median').pvalue<0.05:
#If p-value is less than 0.05, variances of two groups are considered unequal
twos_ttest.append([i,ss.ttest_ind(a=nasadatalog[nasadatalog['score_bi']==1][i],
b=nasadatalog[nasadatalog['score_bi']==0][i], equal_var=False).pvalue])
else:
#Otherwise two groups are considered equal
twos_ttest.append([i,ss.ttest_ind(a=nasadatalog[nasadatalog['score_bi']==1][i],
b=nasadatalog[nasadatalog['score_bi']==0][i], equal_var=True).pvalue])
#Except those three variables, the rest variables are heavily skewed, set center argument to 'trimmed'
else:
if levene(nasadatalog[nasadatalog['score_bi']==1][i],
nasadatalog[nasadatalog['score_bi']==0][i],center='trimmed').pvalue<0.05:
twos_ttest.append([i,ss.ttest_ind(a=nasadatalog[nasadatalog['score_bi']==1][i],
b=nasadatalog[nasadatalog['score_bi']==0][i], equal_var=False).pvalue])
else:
twos_ttest.append([i,ss.ttest_ind(a=nasadatalog[nasadatalog['score_bi']==1][i],
b=nasadatalog[nasadatalog['score_bi']==0][i], equal_var=True).pvalue])
df_twos_ttest=pd.DataFrame(twos_ttest,columns=['Parameter','p-value'])
df_twos_ttest['p<0.05']=df_twos_ttest['p-value'].apply(lambda x:'Yes' if x <0.05 else 'No')
df_twos_ttest.set_index('Parameter')
koi_srho, the stellar density, is the only one that no statistically significant difference exists in two score groups.
Similar to what we've done in linear model, we value the literature reivewed model more and this variable will still be included into our saturated model.
We use logit model under statsmodel, discrete_model library for modeling logistic regression.
import statsmodels.discrete.discrete_model as smdd
nasadatalog2=pd.get_dummies(nasadatalog)
varlist=['koi_period','koi_duration','koi_depth','koi_srho','koi_prad','koi_sma','koi_incl','koi_teq','koi_dor',
'koi_steff','koi_slogg','koi_smet','koi_srad','koi_smass','koi_fittype_LS+MCMC']
X=sm.add_constant(nasadatalog2[varlist])
y=nasadatalog2['score_bi']
logit_model1=smdd.Logit(y,X)
result=logit_model1.fit()
print(result.summary2())
To our surprise, the p-value for koi_srho is 0.9306, which is too great and meaning this variable makes almost no contribution for the outcome classification. We decide to remove it from the saturated model and rerun the reduced model.
varlist2=['koi_period','koi_duration','koi_depth','koi_prad','koi_sma','koi_incl','koi_teq','koi_dor',
'koi_steff','koi_slogg','koi_smet','koi_srad','koi_smass','koi_fittype_LS+MCMC']
X2=sm.add_constant(nasadatalog2[varlist2])
y2=nasadatalog2['score_bi']
logit_model2=smdd.Logit(y2,X2)
result=logit_model2.fit()
print(result.summary2())
The AIC decreased and Log-Likelihood didn't change, meaning this reduced model is better than the saturated model. Although koi_sma is not significant, we consider this p-value level is accpetable and keep it as our finalized our model.
Using koi_steff and koi_fittype as two examples,the interpretations of our model are as follow.
We also want to know how well our logistic model is. One of the verification methods is to use machine learning technology and find out the prediction accruancy.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
X_train, X_test, y_train, y_test = train_test_split(X2, y2, test_size=0.3, random_state=111111)
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
The predicting accuracy is way better than we thought, about 0.81.
Another way to visualize how well the model in prediction is to plot the ROC Curve.
The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner).
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
The last objective is to use our regression model to predict the undetermined koi_score, through which, we can classify the kepler is a candidate or not.
#Finding the rows which have missing value for koi_score:
miss_score=nasarawdata[nasarawdata.koi_score.isnull()]
to_predict=miss_score[regvarlist].copy()
#Filtering out the rows that have missing value only for the koi_score column.
to_predict=to_predict[to_predict.isnull().sum(axis=1)<2]
to_predict.head()
We find something anormal, the koi_fittype is 'MCMC' type, which does not exist in our modeling dataset.
to_predict.koi_fittype.value_counts()
The worst case happen. The 'for prediction' dataset has one completely different feature to the modeling dataset: the koi_fittype that exists is only 'MCMC', and our model does not have a parameter coefficient for koi_fittype=MCMC. Therefore, we cannot use our model to predict the missing score because of the data structure mismatch.
In conclusion, we were surprised and satisfied with the well performance in prediction of our logistic model. For a given astral body with sufficient parameters, we could use this logistic model to predict in which category this astral body will likely to fall with 81% accuracy.
However, the logistic model itself, as well as the linear model, is still not good and requires improvements.
As we mentioned before, there are at least three major flaws wihtin our dataset and modeling procedure.
Outliers
We might generate more robust models after we addressed those issues.
Although we did heavy works on literature review, we found it is not enough for us to improve our model.
Figuring out proper transformations for our independent variables is associated with understanding the physical laws, and we don't have a member who is familiar with that.
To further improve our model, maybe we should ask help from a professional astronomer.